Welcome to the second part of the tutorial where we are going to have a look at another popular social media platform YouTube

Packages to install

library(knitr)
#library(magick)
library(png)
library(tuber)
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 3.0.0     ✔ purrr   0.2.5
## ✔ tibble  1.4.2     ✔ dplyr   0.7.6
## ✔ tidyr   0.8.1     ✔ stringr 1.3.1
## ✔ readr   1.1.1     ✔ forcats 0.3.0
## Warning: package 'dplyr' was built under R version 3.5.1
## ── Conflicts ────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
library(tidytext)

library(grid)
#library(emo)
library(icon)
library(data.table)
## 
## Attaching package: 'data.table'
## The following objects are masked from 'package:dplyr':
## 
##     between, first, last
## The following object is masked from 'package:purrr':
## 
##     transpose
library(psych)
## 
## Attaching package: 'psych'
## The following object is masked from 'package:icon':
## 
##     fa
## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha
library(psych)

options(stringsAsFactors = FALSE) 

Reference material for the following tutorial

Downloading YouTube data with tuber

Once you setup your YouTube access via the Google Developer Console, you can connect to the API and download data.

A reminder: to connect you use yt_oauth().

app_id = "667235664106-a5e6ng7pna0ptv7qoqnrsn0ldm1i68a8.apps.googleusercontent.com" 
app_secret = "8UV8kvt8cZ75EXg-ubOPBjDJ"

#connect
yt_oauth(app_id=app_id, app_secret=app_secret, token='', cache=FALSE)

By default the function looks for .httr-oauth in the working directory in case you connected before. If it doesn’t find it, it passes an application ID and a secret. If you do not want the function to use cache, set cache=FALSE.

The function launches a browser to allow you to authorize the application

Once you are connected, we can searching!

results <- yt_search("World Cup 2018")
kable(results[1:4,1:5])
video_id publishedAt channelId title description
qlZaiBuOaz4 2017-09-09T19:17:50.000Z UC07CHp5ikd-AyafR4J2LWkA No T20 World Cup In 2018, Australia Will host Next WT20 in 2020 It is now officially confirmed by ICC that the 2018 season of ICC WT20 Will not gonna happen,Australia now awarded as the next destination of icc t20 world cup …
sJSH7SM1La0 2018-07-06T18:00:03.000Z UCH0lge-0lFszfPRqFkaoXpw Brasile Vs Belgio “Calci di Rigore” World CUP 2018 Quarti di Finale | PES 2018 Patch [Giù] PES 2018 - Stadium Pack v2 All AIO Download …
Ff554RwjHi0 2018-01-21T11:03:01.000Z UCoNOFr0J8S2yMmX3lB2gs8g ICC World T20 2018 Schedule, Time Table, Venue, Host Country, Fixtures, News ICC World T20 2018 Schedule, All Teams, Time Table, Venue, Host Country, Groups, Fixtures, Important News, Theme and World T20 2018 logo.
aWxbphbmDts 2018-01-22T15:48:13.000Z UCqMp3kP5EP0YXANgY-I43JA T20 World Cup 2018 Full Schedule And Time Table ICC World 20 20 Cup ICC T20 World Cup 2018 Full Schedule And Time Table in urdu hindi ICC t20 World Cup 2018 ICC t20 World Cup full time table Hind Pakistan Vs India T20 …

The yt_search function returns a data.frame with 16 elements, including

The yt_search function takes parameters that let your specify your search. The most popular ones are:

Now let’s have a look at comments of the video. We can pick any video in our search and have a look at its comments, using video_id to get them

results <- get_comment_threads(c(video_id="qlZaiBuOaz4"))
kable(results[1:10, c("authorDisplayName", "textDisplay")])
authorDisplayName textDisplay
Suraj Yadav Sale jhooth bolta hai faltu channal ko subscribe nhi kiya jata
Meena Srivastava Nice video bro
bdayal singh My favorite tournament is T20 world cup
DHRUV SAWANT 2020 me hoga
Nikhil Yadav GU hai TU guuu
Vandana2 Rajpurohit yash bhai app tho ipl ke hi video banate ho aap world cricket ke upar bhi video banao
Arjun Bhandari Ipl k bareme batao jiii rcb k bareme kya change hoga
Vikas Mathur Bhosdee ke tuje bada pata h
Munmun Samanta chip Kar faltu
Mohammed Rafique Asia cup kab hoga bhai

You specify your video id in the filter argument. You can also use this argument to download comments for a specific channel. So, let’s have a look at your favourite YouTube channel. Which one is your favourite? Mine is Bloomberg

But… I guess.. it’s getting too busy for the day, so let’s take a break and have a look at Victoria’s Secrets

It’s Brisbane, Australia after all!

To locate the channel we need to use channel id, not its name.

Get the user/channel id

Obtaining data from YouTube channel and have a look at its stats. is usually done through channel_id, which is not the same as the YouTube name you see in the YouTube link.

There are several ways to obtain your channel_id (YouTube name):

  • Use YouTube Data API https://www.googleapis.com/youtube/v3/channels?key={YOUR_API_KEY}&forUsername={USER_NAME}&part=id where YOUR_API_KEY is the key you create in the google developer account USER_NAME is the YouTube channel (username)

  • Use the page source code Open the YouTube page in the browser and view Source page. Search for either externalId or data-channel-external-id. The value there will be the channel id.

Downloading channel stats

Now that we have the channel id, lets get a list of its videos and have a look at its stats

channel_id<-"UChWXY0e-HUhoXZZ_2GlvojQ"
videosVS = yt_search(term="", type="video", channel_id = channel_id)
kable(videosVS[1:4,1:5])
video_id publishedAt channelId title description
HLo_3GfHjps 2012-11-30T17:10:00.000Z UChWXY0e-HUhoXZZ_2GlvojQ Miranda Kerr and Bruno Mars Backstage at the 2012 Victoria’s Secret Fashion Show Get ready for dimples, winks and giggles when Miranda Kerr quizzes Bruno Mars backstage at the 2012 Victoria’s Secret Fashion Show. Catch them both on the …
76gf3YeGf1s 2010-11-30T17:49:42.000Z UChWXY0e-HUhoXZZ_2GlvojQ Before I Was a Supermodel: Behati Prinsloo On the eve of the 2010 Fashion Show, Supermodel Behati Prinsloo shares some memories about growing up and becoming a model.
wMYYQkwLDqM 2013-10-17T15:27:54.000Z UChWXY0e-HUhoXZZ_2GlvojQ Candice Swanepoel Meets the Royal Fantasy Bra Go on set as Victoria’s Secret Angel Candice Swanepoel gets to see and try on the Royal Fantasy Bra for the very first time. Look for Candice and the $10 million …
DkcctlhyWEs 2015-08-22T12:55:46.000Z UChWXY0e-HUhoXZZ_2GlvojQ Romee Strijd on Becoming a Victoria’s Secret Angel Dutch supermodel Romee Strijd talks about how she was discovered, being cast for the Victoria’s Secret Fashion Show and her journey to becoming an Angel.
#get channel stats
statsVS<-get_channel_stats(channel_id=channel_id)
## Channel Title: Victoria's Secret 
## No. of Views: 265955120 
## No. of Subscribers: 1504923 
## No. of Videos: 743
statsVSSelected <- as.vector(statsVS$statistics)
results<-do.call(rbind, statsVSSelected)
head(results)
##                       [,1]       
## viewCount             "265955120"
## commentCount          "0"        
## subscriberCount       "1504923"  
## hiddenSubscriberCount "FALSE"    
## videoCount            "743"

The get_channel_stats function is quite straightforward. It take channel_id as an argument and returns a nested list. We can select the items we need from the list and convert it to a data.frame

Downloading stats for videos

Now that we have a list of videos from VS channel, let’s download stats for each video. As an example let’s do first 10

videosVS_sample<-videosVS[1:10,]

videoStatsVS = lapply(as.character(videosVS_sample$video_id), function(x){
  get_stats(video_id = x)
})
videoStatsVS_df = do.call(rbind.data.frame, videoStatsVS)
head(videoStatsVS_df)
##             id viewCount likeCount dislikeCount favoriteCount commentCount
## 2  HLo_3GfHjps   2260468     16320          277             0         1826
## 21 76gf3YeGf1s    885113      5463           72             0          299
## 3  wMYYQkwLDqM   2490444     17828          299             0         1305
## 4  DkcctlhyWEs    954050      9356          202             0          402
## 5  anhEcKWRzyE   2167444     37678          569             0         1674
## 6  LwelJNogl-M    883141      7715           88             0          200

The function uses video ids, but does not return video title and dates, which we can add ourselves and do some clean-up

videoStatsVS_df$title = videosVS_sample$title
videoStatsVS_df$date = videosVS_sample$date

library(tidyverse)
videoStatsVS_df = as.tibble(videoStatsVS_df) %>%
  mutate(viewCount = as.numeric(as.character(viewCount)), #originally as factor
         likeCount = as.numeric(as.character(likeCount)),
         dislikeCount = as.numeric(as.character(dislikeCount)),
         commentCount = as.numeric(as.character(commentCount)))

head(videoStatsVS_df)
## # A tibble: 6 x 7
##   id     viewCount likeCount dislikeCount favoriteCount commentCount title
##   <chr>      <dbl>     <dbl>        <dbl> <chr>                <dbl> <chr>
## 1 HLo_3…   2260468     16320          277 0                     1826 Mira…
## 2 76gf3…    885113      5463           72 0                      299 Befo…
## 3 wMYYQ…   2490444     17828          299 0                     1305 Cand…
## 4 Dkcct…    954050      9356          202 0                      402 Rome…
## 5 anhEc…   2167444     37678          569 0                     1674 Road…
## 6 LwelJ…    883141      7715           88 0                      200 Vict…

I ran the function with the full list of video ids for the channel and you can use this file videoStatsVS_df.csv under the YouTube tutorial-data folder. Let’s load it

videoStatsVS_df <- as.data.table(read.csv("YouTube tutorial-data/videoStatsVS_df.csv", stringsAsFactors=FALSE))
head(videoStatsVS_df)
##    X          id viewCount likeCount dislikeCount favoriteCount
## 1: 1 HLo_3GfHjps   2259491     16317          277             0
## 2: 2 76gf3YeGf1s    884766      5463           72             0
## 3: 3 wMYYQkwLDqM   2490140     17827          299             0
## 4: 4 DkcctlhyWEs    953821      9353          202             0
## 5: 5 anhEcKWRzyE   2165304     37656          569             0
## 6: 6 LwelJNogl-M    883017      7711           88             0
##    commentCount
## 1:         1826
## 2:          299
## 3:         1305
## 4:          402
## 5:         1673
## 6:          200
##                                                                               title
## 1: Miranda Kerr and Bruno Mars Backstage at the 2012 Victoria's Secret Fashion Show
## 2:                                       Before I Was a Supermodel: Behati Prinsloo
## 3:                                    Candice Swanepoel Meets the Royal Fantasy Bra
## 4:                               Romee Strijd on Becoming a Victoria’s Secret Angel
## 5:                                         Road to the Runway: Episode 1 – Castings
## 6:                                            Victoria’s Secret Angel Outtakes 2014

Let’s see which video was most popular:

The most view counts are:

videoStatsVS_df %>% arrange_(~ desc(viewCount)) %>%
  top_n(n = 5) %>% 
  select(title, viewCount, likeCount, favoriteCount, commentCount, id)
##                                    title viewCount likeCount favoriteCount
## 1 Why Alessandra Ambrosio loves her body    773575      3250             0
## 2                        You've Got Male    422072       881             0
## 3    Wild & Beautiful: Behind the Scenes    338447      1115             0
## 4   Why Lindsay Ellingson loves her body    178581       320             0
## 5    Why Emanuela DePaula loves her body    118804       339             0
##   commentCount          id
## 1           23 EMOxRWd-Q1k
## 2           19 drxva8UM-zM
## 3           17 nmTT42Q0bsI
## 4           17 B3QSmqcTojQ
## 5           11 jANpsZGKPXA

Let’s have a look at it!

The most likes are:

videoStatsVS_df %>% arrange_(~ desc(likeCount)) %>%
  top_n(n = 5) %>% 
  select(title, likeCount, viewCount, favoriteCount, commentCount, id)
##                                    title likeCount viewCount favoriteCount
## 1 Why Alessandra Ambrosio loves her body      3250    773575             0
## 2    Wild & Beautiful: Behind the Scenes      1115    338447             0
## 3                        You've Got Male       881    422072             0
## 4    Why Emanuela DePaula loves her body       339    118804             0
## 5   Why Lindsay Ellingson loves her body       320    178581             0
##   commentCount          id
## 1           23 EMOxRWd-Q1k
## 2           17 nmTT42Q0bsI
## 3           19 drxva8UM-zM
## 4           11 jANpsZGKPXA
## 5           17 B3QSmqcTojQ

The most comments got:

videoStatsVS_df %>% arrange_(~ desc(commentCount)) %>%
  top_n(n = 5) %>% 
  select(title, commentCount, viewCount, likeCount, favoriteCount, id)
##                                    title commentCount viewCount likeCount
## 1 Why Alessandra Ambrosio loves her body           23    773575      3250
## 2                        You've Got Male           19    422072       881
## 3    Wild & Beautiful: Behind the Scenes           17    338447      1115
## 4   Why Lindsay Ellingson loves her body           17    178581       320
## 5    Why Emanuela DePaula loves her body           11    118804       339
##   favoriteCount          id
## 1             0 EMOxRWd-Q1k
## 2             0 drxva8UM-zM
## 3             0 nmTT42Q0bsI
## 4             0 B3QSmqcTojQ
## 5             0 jANpsZGKPXA

Downloading titles and comments

Let’s have a look at titles now and see what we can find there.

Continuing with the “girl power” theme, let’s compare VS to another powerhouse, US Vogue.

Following the procedure described earlier, I downloaded video stats for both VS and US Vogue channels and merge them into one file, videostats_All.csv

Further analysis will include manipulation with text, so we will need tidyverse and tidytext packages

library(tidyverse)
library(tidytext)

you can either download the channel stats yourself (see above) or use videostats_All.csv

Channel ids are: * Americanvogue = UCRXiA3h1no_PFkb1JCP0yMA * VICTORIASSECRET= UChWXY0e-HUhoXZZ_2GlvojQ

To load the existing file let’s do this

videostats_All <- as.data.table(read.csv("YouTube tutorial-data/videostats_All.csv", stringsAsFactors=FALSE))
head(videostats_All)
##          date
## 1: 2016-01-05
## 2: 2016-01-18
## 3: 2016-01-19
## 4: 2016-01-22
## 5: 2016-02-02
## 6: 2016-02-11
##                                                                    title
## 1:                   Inside the Brooklyn Home of Artist Mickalene Thomas
## 2:                      Watch Irene Kim’s Japanese Hot Springs Adventure
## 3:                                     73 Questions With Derek Zoolander
## 4: Watch Pop Star in the Making Zella Day Perform a Heartbreaking Ballad
## 5:                       Cleaning House With Organizing Guru Marie Kondo
## 6:      Here’s What It’s Like to Be Lucky Blue Smith During Fashion Week
##       video_id viewCount likeCount dislikeCount commentCount source
## 1: SMX60fh5u7o     29346      1304            7           27  Vogue
## 2: r-shJpCflvQ     59103      1206           29           38  Vogue
## 3: H4q0K561WXs   2395735     34455         1817         1754  Vogue
## 4: wg19p1cDSRE     58920      2104            9          101  Vogue
## 5: z3OXvQZe7g8    392529      4121          870          171  Vogue
## 6: LIKTr9vXRQI    141677      2853           52          131  Vogue

Let’s have a brief look at the data. We are going to use the stargazer package which is fantastic for generating “academic” looking results and describeBy function from the psych package that generates statistics by a grouping variable. We will group variables by channel is.

library(stargazer)

stargazer(videostats_All[,.(viewCount, likeCount, commentCount, dislikeCount)], median=TRUE, digit= 1, type = "text")
## 
## ============================================================================================
## Statistic     N      Mean        St. Dev.     Min  Pctl(25)  Median    Pctl(75)      Max    
## --------------------------------------------------------------------------------------------
## viewCount    456 1,199,256.000 2,452,764.000 6,477 115,134  367,336.5 1,082,360.0 20,665,383
## likeCount    456  25,219.360    66,559.910    145   1,992    5,040.5    18,092     747,921  
## commentCount 455   1,150.145     3,401.707   4.000  85.500   202.000    707.500   31,101.000
## dislikeCount 456    727.618      2,297.890     1      33      88.5        425       29,639  
## --------------------------------------------------------------------------------------------
## 
## =
## 1
## -
library(psych)

results<-describeBy(videostats_All[, .(viewCount, likeCount, commentCount, dislikeCount)], 
          group=videostats_All$source, digits=1, mat=TRUE) 

results[,c(1:7, 10:11)]
##               item group1 vars   n      mean        sd median   min
## viewCount1       1  Vogue    1 307 1512729.9 2874266.3 478164  6477
## viewCount2       2     VS    1 149  553373.0  889076.4 208598 31747
## likeCount1       3  Vogue    2 307   35448.6   79096.3  10367   145
## likeCount2       4     VS    2 149    4143.1    4528.7   2468   722
## commentCount1    5  Vogue    3 306    1611.7    4067.6    359     4
## commentCount2    6     VS    3 149     202.3     235.9    111    17
## dislikeCount1    7  Vogue    4 307    1042.1    2745.7    211     1
## dislikeCount2    8     VS    4 149      79.7     136.8     43     5
##                    max
## viewCount1    20665383
## viewCount2     5348750
## likeCount1      747921
## likeCount2       37661
## commentCount1    31101
## commentCount2     1673
## dislikeCount1    29639
## dislikeCount2     1189

Just a reminder that: * mean: average * st. dev: is a measure of variation in the data compared to the average * min and max: extreme values

Likely that the number of views relates to the number of likes: the more people view the video, the more they “like” it. We can do a correlation for this using corr.test function from the same psych package

results<-corr.test(videostats_All[, .(viewCount, likeCount, commentCount, dislikeCount)], use = "complete",method="pearson",adjust="holm",
          alpha=.05,ci=FALSE)
results$r
##              viewCount likeCount commentCount dislikeCount
## viewCount    1.0000000 0.8130140    0.8078345    0.8286073
## likeCount    0.8130140 1.0000000    0.9469651    0.6756837
## commentCount 0.8078345 0.9469651    1.0000000    0.7457451
## dislikeCount 0.8286073 0.6756837    0.7457451    1.0000000

Or we can have a plot it on a graph with the help of gglot2 and gridExtra

library (ggplot2)
library(gridExtra)
p1=ggplot(data = videostats_All[-1, ]) + geom_point(aes(x = viewCount, y = likeCount))
p2=ggplot(data = videostats_All[-1, ]) + geom_point(aes(x = viewCount, y = dislikeCount))
p3=ggplot(data = videostats_All[-1, ]) + geom_point(aes(x = viewCount, y = commentCount))
grid.arrange(p1, p2, p3, ncol = 2)
## Warning: Removed 1 rows containing missing values (geom_point).

You just cannot LOVE Adriana Lima! But move on….

As you see the title column has a title that describes the video. It is logically to assume that title is the first to attract attention of the viewer. Let’s have a closer look and see if we can identify specific words.

Let’s tokenize the title, clean it from stop words and calculate frequencies of words in the title. Frequency is calculated as the number of times a particular word is used in the title compared to the total number of different words used in the channel.

title_words_All_Source<-videostats_All %>%
  as.tibble() %>% 
  unnest_tokens(word, title) %>%
  anti_join(stop_words) %>%
  count(source, word, sort = TRUE) %>%
  left_join(videostats_All %>% 
              group_by(source) %>% 
              summarise(total = n())) %>%
  mutate(freq = n/total) 
## Joining, by = "word"
## Joining, by = "source"